AWS Server Issues in Northern Virginia: What Was the Problem?
It was Wednesday, November 25, a day like any other in Northern Virginia, in the southeastern United States, when Amazon Web Service suffered a service outage causing significant problems to many online services.
After accurately and meticulously analyzing the issue, from the Seattle headquarters they said that the outage only occurred in the Northern Virginia region specifically after a “small addition of capacity” to its front-end fleet of Kinesis servers.
This is no small inconvenience if we consider the fact that Amazon Kinesis, an AWS tool that allows the real-time processing of streaming data, in addition to its direct use by customers, is used by large companies such as Adobe Spark, Roku, Flickr or Autodesk. This means that almost all of the major cloud-based software apps that rely on Amazon Kinesis for their back-end have been affected by the disruption.
Suffice it to say, in fact, that the problems also affected cryptocurrency portals that failed to process transactions and streaming and podcast services that limited users’ access to their accounts. Among the sites that have reported issues on the DownDetector page are services such as Ring, Prime Music, Pokemon Go, Roku, MeetUp.com, League of Legends, Anchestry.com, Chime, and more.
According to the Cloud giant, the outage happened after a “small addition of capacity” to its front-end fleet of Kinesis servers.
“
The triggering factor, although not the main cause of the event,
– the company is keen to point out – It was a relatively small addition of capacity that began to be added to the service at 2:44 a.m., ending at 3:47 a.m. Kinesis has a large number of back-end cell clusters that process the streams. These are Kinesis’ workhorses, providing deployment, access, and scalability for stream processing. Streams are disseminated to the back-end via a sharding mechanism owned by a fleet of front-end servers. A backend cluster possesses many fragments and provides a consistent scaling unit and fault isolation. The work of the front end is small but important. Manages authentication, throttling, and routing of requests to the correct stream-shards on back-end clusters”.
“
At 9:39 a.m.
,” they continue We were able to confirm that the root cause was not due to memory pressure. Rather, the new capacity had caused the maximum number of threads allowed by an operating system configuration to be exceeded on all servers in the fleet. When this limit was exceeded, cache construction was not completed and front-end servers were left with useless fragmented maps that made them unable to route requests to back-end clusters”.
In short, the problem would have been triggered by the desire to increase the capacity of the system. The attempt to add new servers to Amazon’s dominant Cloud Computing network triggered a series of cascading errors that caused problems for several online services.
Acknowledging one’s mistakes, however, is essential and in this case the Cloud giant was quick to apologize to its customers. “We will do everything we can to learn from this event and use it to improve further,” they said.